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An approach to the extraction of the two-photon exchange (TPE) correction from elastic ep 
scattering data is presented. The cross section, polarization transfer (PT), and charge asymmetry 
data are considered. It is assumed that the TPE correction to the PT data is negligible. The 
form factors and TPE correcting term are given by one multidimensional function approximated by 
the feed forward neural network (NN). To find a model-independent approximation the Bayesian 
framework for the NNs is adapted. A large number of different parametrizations is considered. The 
most optimal model is indicated by the Bayesian algorithm. The obtained fit of the TPE correction 
behaves linearly in e but it has a nontrivial Q 2 dependence. A strong dependence of the TPE fit on 
the choice of parametrization is observed. 
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1. INTRODUCTION 

The study of elastic ep scattering provides an oppor- 
tunity to explore the structure of the proton. From ep 
cross section data the magnetic (Gm) and electric (Ge) 
proton form factors (FFs) are obtained via longitudinal- 
transverse (LT) separation 0, In the data analysis, it 
is convenient to consider the reduced cross section, which 
in the one-photon exchange (OPE) approximation, reads 
as 

^AQ 2 ^) = rG 2 M (Q 2 ) + eG E (Q 2 ), (1) 
r = Q 2 /4M 2 , e = (l + 2(l + r)tan 2 ((9/2))~ 1 , 




FIG. 1: The 2-(3-2)-3 type network, two input units, one 
layer of hidden units, and three output units. The FF sector 
(gray filled units and dashed connections) in contrast to the 
TPE sector (black units and solid connections) is connected 
only with Q 2 . Dotted lines denote the switched-off connec- 
tions. Solid and dashed lines represent the weight parameters. 
The bias unit is not connected with the units from the pre- 
vious layer, and its signal is equal to 1, which means that 

rbias i 

J act r. 



where Q 2 and 9 are the four-momentum transfer squared 
and scattering angle respectively. 
The FF ratio, 

(jip, the magnetic moment of the proton) can be ex- 
tracted from the so-called polarization transfer (PT) 
measurements Q. It turns out that the systematic dis- 
crepancy between the FF ratio data obtained via the LT 
separation and the PT measurements exists. It seems 
that taking into account the two-photon exchange (TPE) 
correction, the one which is not included in the classical 
treatment of the radiative corrections cancels this 
discrepancy [1,0]. Moreover, it is claimed that TPE cor- 
rection to the [i p Ge/Gm ratio, extracted from the PT 
measurements, is negligible @, 0]- However, taking into 
account the TPE contribution, in the LT separation, af- 
fects significantly the extracted values of the proton FFs. 
The reduced cross section is modified by the TPE cor- 
rection AC2 7 (Q 2 ,e), namely, 

ffi 7 +2 7 ,fl(<3 2 , e) = <7i 7 ,tf(Q 2 , e) + AC 27 (Q 2 , e). (3) 

The TPE effect has been studied extensively during 
thepast few years, for reviews and references, see Refs. 
d S] . The recent studies can be found in Refs. @, \iML2 | . 

The dominant part of AC2 7 (Q 2 , e) is given by the in- 
terference between the OPE and the TPE amplitudes. 
Hence, for the e + p scattering, 

AC 2l (e + p) = -AC* 27 (e-p). (4) 

Therefore, the magnitude of the TPE term can be eval- 
uated by measuring the ratio of the e + p to e~p elastic 
cross sections [3, 0] , 
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FIG. 2: Logarithm of evidence, see Eq. (IDlfl , Each single 
point, in the plot, represents the ln(evidence) obtained for one 
particular NN architecture. 



A deviation of this function from unity indicates the im- 
portance of the TPE effect 
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A direct prediction of the proton FFs and TPE correc- 
tion is a difficult task. One has to deal with the problems 
of quantum chromodynamics in the non-perturbative 
regime. The successful approaches are rather phc- 
nomenological, and contain many internal parameters, 
which arc fixed to reproduce the experimental data (for 
reviews see Refs. [HH, [t|). 

On the other hand, the existing elastic polarized and 
unpolarized e~p and e + p scattering data cover kinemat- 
ical region broad enough to reconstruct the FFs depen- 
dence on Q 2 . To combine the cross section data with the 
PT measurements and the e~p/e + p ratio data allows one 
to obtain information about the TPE contribution. 

The aim of this paper is to find the approximation of 
the FFs and the TPE contribution by mainly relying on 
the experimental data. We reduce the model-dependent 
assumptions to the necessary minimum. 



2. TPE AND NEURAL NETWORKS 

Only three complex FFs, which depend on Q 2 and e, 
are required to describe the elastic unpolarized and po- 
larized ep [f| scattering amplitudes. Hence, six real func- 
tions have to be determined from the data. 

We assume that the PT ratio data are not affected 
by the TPE effect. Then, one can show that only three 
unknown functions have to be found: two proton FFs 
and the AC2 7 correcting term [see Eq. Analyses 
with similar TPE assumptions have been performed by 
many groups 

In this paper, we consider the cross section, the PT, 



and the e + p/e~p ratio data. To consider at least three 
different data types appeared to be necessary because of 
the limited model assumptions about the TPE term. 

To approximate the FFs and TPE function one has to 
assume particular empirical parametrization. However, 
it is obvious that the choice of the functional form of the 
parametrization has an impact on the fit and its uncer- 
tainties. In particular, it is the case of the TPE contri- 
bution. This problem was not discussed in the previous 
analyses. 

In the approach presented in this paper, fitting the 
data means the construction of the statistical model with 
the ability to predict the FFs and the TPE term. We ap- 
ply the methods of the Bayesian statistics, which allows 
performing a model comparison. Indeed, we consider as 
many different data paramctrizations as possible, and the 
best model is indicated by the objective Bayesian proce- 
dure. 

In practice, one has to evaluate the probability 
distribution V (model) in the space of all functional 
parametrizations of the FFs and the TPE contribution. 
The best model should maximize this probability. 

It is obvious that the magnetic and electric FFs as well 
as the TPE correction function are correlated. All of 
them should be determined by the same underlying fun- 
damental model. Therefore, one can imagine that there 
exists a multidimensional function, defined by the set of 
parameters, which simultaneously describe all Gm, Ge 
and AC2 7 . In this paper, we use the artificial neural net- 
works (ANNs) to approximate this function. We consider 
a particular type of ANN, the feed forward neural net- 
work (NN) in the so-called multi layer perceptron (MLP) 
configuration. 

The experimental data, which are analyzed here, de- 
pend on either one (only Q 2 ) or two (Q 2 and e) kine- 
matical variables. Hence, the MLP must map two- 
dimensional input space [in = (Q 2 ,e) T ] to output space, 
spanned by three functions out = (Gm, Ge, AC2 7 ) t . 

We consider MLP networks that consist of three lay- 
ers of units: input, hidden layer, and output (see Fig. 
[I}. Each single neuron (unit) of the network calculates 
its output value as an activation function f act of the 
weighted sum of its inputs f ac t (Xa ^mOi where de- 
notes the i-th weight parameter, while \ii represents the 
output value of the unit from the previous layer. 

For the activation functions we take the sigmoid 
sigmoid(w) = 1/(1 + exp(— w)) and the linear functions 
for the hidden and output units, respectively. It has been 
proven [25| that the maps given by the networks with one 
hidden layer and with the sigmoid like activation func- 
tions, in this layer, are sufficient to approximate any con- 
tinuous real function 1 . Indeed, we assume that the FFs 



1 The two-hidden layer NNs are sufficient to approximate any func- 
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a result, the network is divided into two sectors. One, 
called the latter FF sector, which is disconnected with 
the e input and the second, called the latter TPE sector, 
which is connected with both input values. The FFs and 
TPE correction are still determined by the large subset 
of common weights. 

An example of the 2-(3-2)-3 network (^3,2) is drawn 
in Fig. [TJ It consists of two input units, five units in the 
hidden layer (three units belong to the FF sector, and 
two units belong to the TPE sector), and 3 output units. 

The choice of the particular configuration of units and 
number of the hidden units, defines the network archi- 
tecture Agj- For given A 9t t the particular map A/" 9l t is 
defined as 



G e 
AC 27 _ 



Q 2 (GeV 2 ) 



. (6) 

To have network A g ,t, the optimal values of the weight 
parameters have to be found. The process of establish- 
ing them is called the training of the network and it is 
described in the next part of the paper. 

It is obvious that, to find the optimal network architec- 
ture, which approximates desired map well, the number 
of hidden units has to be varied. In this paper we apply 
the method that allows estimating the optimal size of the 
hidden layer. 



FIG. 3: (a) (Top panel) G E /G D , Gm/^Gd [G d = 1/(1 + 
Q 2 /0.71 GeV 2 ) 2 ] and ratio (jl p G e /G M - The predictions of 
the proton FFs of Arrington et al. pj| also are shown. The 
PT fipGE /Gm data are taken from Refs. USUI. Shaded 
areas denote la uncertainty, (b) (Bottom panel) The Q 2 de- 
pendence of AC2 7 /cti 7 +27,jj at e —OA and 0.8. The shaded 
area denotes la uncertainty computed for the fit at e = 0.4. 
The dotted lines denote the TPE term predicted at e = 0.4 
by the networks that have lower than best-fit evidence values. 



and TPE term are the continuous functions of kinemati- 
cal variables. Notice that the efficiency of approximation 
depends on the number of hidden units. In some prob- 
lems it might be a very large number. 

Additionally let us mention a useful property of the 
sigmoid function. Its effective support is concentrated in 
the close neighborhood of w = 0. With increasing |ui|, 
the sigmoid saturates. This property allows restricting 
the effective range of the weight parameters. 

The Gm and Ge only depend on Q 2 . This property 
is achieved by the particular choice of the architecture 
of MLP, namely some of the connections are erased. As 



tion. 



3. BAYESIAN FRAMEWORK 

The MLPs with a larger number of units (with many 
weights) have a better ability to represent the data. How- 
ever, usually too complex parametrizations exactly re- 
semble the data and usually tend to reflect the statistical 
fluctuations. Thus, the generality of the description is 
lost, and the data are over-fitted. Moreover, the com- 
plex parametrizations may lead to larger uncertainties 
than the simple models. On the other hand, too simple 
parametrizations are not capable of coding all the impor- 
tant information hidden in the measurements. The fits 
described by the simple functions might be characterized 
by the underestimated uncertainties. 

A task of finding the optimal statistical model that 
represents the data accurately enough but does not over- 
fit the data is known in statistics as the bias-variance 
trade off problem. In the previous global analyses of 
the ep data the degree of the complexity of the FFs and 
TPE parametrizations was chosen with the help of phe- 
nomenological arguments and common sense. In this pa- 
per, we wish to apply the objective Bayesian methods, 
which allow quantitatively investigating the complexity 
of the FFs and TPE correcting functions. 

The Bayesian framework (BF) formulated for the NN 
computations p6| faces the problems described above. 
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This approach has already been adapted to approximate 
the electromagnetic nucleon FFs [27| and, here, it is de- 
veloped to study the TPE effect. The BF was designed 
to: 2 

• quantitatively classify the statistical hypothesis; 

• objectively choose the best network architectures and 
consequently, the number of hidden units; 

• find the optimal values of the weight parameters; 

• objectively establish the training parameters, such as 
the regularization parameter a (it will be explained 
below). 

Notice that to deal with the overfitting problem one 
can also use the cross-validation technique. It is comple- 
mentary approach which has been applied by the NNPDF 
group for fitting the parton distribution functions [29[. 

At the beginning of the Bayesian analysis, we assume 
that all possible models are equally likely, 



V(A 



g,tj 



(7) 



where V(A g ,t) denotes the prior probability. 

With the help of Bayes' theorem the posterior proba- 
bility for a given model (network) is obtained as 



HAg,tP) = 



V{V\A g ,t)V{A g , t ) 
T(V) 



(8) 



where V is the experimental data, T(Ai\V) is the prob- 
ability of the model given data T>, and V(T>) is some 
constant real number. Because of the prior assumption 
0, it is obvious that, to classify the hypothesis, it is 
enough to evaluate the evidence V(T>\A g j)-the proba- 
bilistic measure of goodness of fit. 

For given network architecture A g ^, the optimal wmp 
weight parameters should maximize the posterior proba- 
bility, 



„/, ffl7 , m , w,{I},A g , t )V(w\ {I},Ag,t) 

FW P(V\{l},A gt ) 

(9) 

where V (T>\ w, {I}, A g .t) is the likelihood function of the 
data, V (w\ {T};A g .t) denotes the prior probability, and 
{/} denotes the set of initial constraints. 
The data likelihood function is defined by 



V(V\w, {l},A g ,t) ~exp(-£« B (Z>,in)). 



where 



S ex (V, w) = xl + Xpt + X±+ xh M + Xg, 



(10) 



(11) 



2 A comprehensive introduction to Bayes' techniques in neural 
computation can be found in Ref. [28ll , 



is the experimental error function. By \\ pt ± j we de- 
note the error functions of the cross section (|A1[) . the 
PT (|A"2|) and the e + p/e~p ratio (TO)) data. Eventually, 



denotes the error function introduced to take the 



M/E 

two artificial FF points into account (see the discussion 
below). 

We distinguish between the ANN and the physical ini- 
tial constraints {1} = {2}ann U {I} p h ys .- 

The ANN constraints {1}ann are introduced to face 
the overfitting problem. Indeed, defining the prior prob- 
ability as follows: 

V (w\ {I}ann, Ag >t ) = V {w\ a,A g ,t) ~ exp(-aE w (w)) 

(12) 

(13) 



E, 



i£all weights 



prevents getting the overfitted parametrizations. 

The physical constraints {I} p hys. are motivated by the 
general properties of the FFs and the TPE term 19. 23j|. 
We assume that, at Q 2 — 0, Gm/Mp — Ge = 1 and 
AC2 7 (e = 1) = 0. In practice, three artificial data 
points are added to the experimental data sets, namely, 
[G m (0)/m p = 1,AG M (0) = A], [G E {0) = 1,AG M (0) = 
A], and [R.±(0A) = l,AK±(0A) = A], where A = 0.01. 
The influence of the A value on the fits and the train- 
ing process were investigated in the preliminary stage of 
the analysis. It was obtained that, with A < 0.01, the 
efficiency of the training process was very low, while re- 
taining A > 0.01 was not sufficient to attract the fit for 
the desired value at the constraint points. 

One can show that the maximum of the posterior prob- 
ability (J9j) corresponds to the minimum of the total error 
function, 



Sex 

(D, w) + aE w (w). 



(14) 



Let wmp denote the weight configuration, which mini- 
mizes the above expression. To find the minimum (|14p . 
the quick-prop gradient descent algorithm [30(, is applied. 
The weight parameters are updated iteratively, because 
of the algorithm, until the minimum is reached. 

The proper choice of the a parameter is crucial for- 
getting the fits and for further model comparison. If the 
a parameter is small then the penalty term (|13j) does not 
significantly affect the results of the training process. In 
the language of the Bayesian statistics, a small a value 
corresponds to the large width of the prior probability 
distribution (jl2]l . 

The BF provides a recipe on how to establish the op- 
timal a parameter. For the optimal «mp and wmp the 
probability distribution, 



V(a\ V,Ag, t ) 
is maximized. 



V{V\ a,A g ,t)V{a\ A g ,t) 
■P(V\A g , t ) 



(15) 
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From the above expression, one can obtain, the neces- 
sary condition (jCll) . which has to be satisfied by «a/p- 
In this version of the BF approach, we consider the so- 
called ladder approximation, where the expression (|C1[) is 
used to iteratively update the a value (during the train- 
ing process), as long as it converges. In reality, because 
Eq. (jCll) is valid in the neighborhood of the minimum, 
the a parameter is not changed until the training pro- 
cess approaches the close surrounding of the minimum. 
In this part of the training ao is fixed and equals 0.01. 
Then, it starts to be updated. 

In general, for every weight parameter, one would in- 
troduce its own regularization factor. However, notice 
that a is an example of the scale parameter of the model, 
given that network A g t is symmetric under the permuta- 
tion of units in the hidden layer. 3 . This property allows 
reducing the number of independent regularization pa- 
rameters to six: three in the FF sector (hidden, bias, 
and linear regularization factors) and three in the TPE 
sector (similar to before). On the other hand, AC2 7 is 
given by the linear combination of the nucleon FFs, mul- 
tiplied by the additional TPE-like FFs (see Eq. (14) of 
Ref. 8]). It means that, if the parameters of the FF 
sector are scaled, then the weights of the TPE sector 
should also be rescaled. This property only seems to be 
approximate, but we use it to reduce the number of reg- 
ularization parameters to three. Eventually, we noticed 
that in our previous paper, it was shown that it was 
enough to consider one regularization parameter to fit 
the FF data 27[. Hence, to simplify the numerical calcu- 
lations and also to accelerate the training process (more 
than 45 000 training processes have been performed), we 
consider the simplest regularization scenario with one a 
parameter. However, as described above, this part of the 
approach can be improved |3lj . 

By having the optimal values of the weight and a pa- 
rameters the evidence is computed from Eq. (|D1I) The 
logarithm of evidence is given by two main contributions: 
the misfit of the approximate data (the experimental er- 
ror function at the minimum) and the Occam factor. 
The latter penalizes complex models. The most optimal 
model has the highest value of evidence. The evidence 
formula and the description of its properties can be found 
in Appendix [D] 



4. NUMERICAL ANALYSIS 

As mentioned in the previous section, we consider three 
types of measurements: the cross section (27 sets), the 
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3 Appropriate permutation of he units in the hidden layer of ci- 
ther the FF and/or the TPE sectors does not change the output 
response. 



FIG. 4: (a) (Top panel) Ratio 1Z± predicted by the network 2- 
(5-6)-3. The grey areas denote la uncertainty, (b) (Bottom 
panel) AC2 7 /Gfj dependence on e. The darker and lighter 
gray areas denote la uncertainty. 



PT (14 sets) and the e + p/e~p ratio (3 sets) data. 

The selection of the cross section and PT ratio data 
sets is the same as in one of our previous papers [l6j |. 
However, in the case of the PT data, two data sets are 
replaced with their recent updates [3j. Additionally, we 
also include the latest PT measurements of the FF ratio 
0, 0|. Since the presence of the PT ratio data is re- 
quired to properly extract the TPE contribution, we only 
consider the cross section points below Q 2 — 10 GeV 2 . 
Above this limit, the PT data are not available. 

In the case of the cross section data, similar to Ref. 
[l6| , the systematic normalization uncertainties are taken 
into account. For every data set, a normalization param- 
eter is introduced and it is established during training. 
The procedure is described in Appendix |B| 

We consider the networks of type 2-(g-t)-3, where 
4 < g + t = Al < 12. In the preliminary stage of the 
analysis, it has been observed that the networks with ei- 
ther g=l, or t=l have not been able to approximate the 
data well (similar to the networks with M < 4). There- 
fore we only consider models with <?,i > 1. Finally we 
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discuss 45 different ANN architectures. For every net- 
work A g j type 10 3 networks, with randomly chosen ini- 
tial values of weights, have been trained. Among them, 
the parametrization with the highest evidence was used 
for further model comparison. It turned out that the 
highest evidence value was obtained for network 2-(5-6)- 
3 (see Fig. |2). 

In Fig. [3Ja) , we plot the FF ratio 7Z%~ computed with 
the network A5.6 (our best fit). The shaded areas denote 
la uncertainty computed from the covariance matrix of 
the fit. Our predictions of the FFs are compared to the 
results of Ref. [3] where the TPE function was pos- 
tulated based on the phenomenological arguments. The 
discrepancy between our fits and those of Ref. [l8[ ap- 
pears above Q 2 =4 GeV 2 . 

In Fig.[3jb), the Q 2 dependence of ratio AC2 7 /ci 7+ 2 7 
is presented. We see that at Q 2 ~ 0.2 GeV 2 the TPE 
correcting term has a local minimum and it becomes the 
decreasing function of Q 2 above 2 GeV 2 . With grow- 
ing Q 2 , fit uncertainty also enlarges. Indeed, above 
Q 2 = 6 GeV 2 , the number of experimental points is lim- 
ited and the data are not accurate enough to get an ex- 
act approximation. It is interesting to mention that, for 
large e (above 0.8) and Q 2 around 1.5 GeV 2 , the TPE 
correction is positive. 

In Fig. [31(b), we also plot the TPE contribution (dotted 
lines) predicted by the models: Ai,2, .44,3, Ae,2, Ae,3, 
Aq^, and A5 j. They are characterized by lower evidence 
values than the A$fi model, but they could be acceptable 
because of the x 2 method (their Xmm values are much 
lower than the number of points). The difference between 
these fits and the prediction of A§fi is spectacular. It 
demonstrates that the model comparison is crucial for 
the proper choice of TPE parametrization. 

By keeping the forthcoming measurements of the elas- 
tic e~p and e + p scatterings [32] in mind, in Fig. SJa), 
we plot our predictions of ratio TZ±. Although, we have 
not assumed the linearity of the TPE term in e, the fi- 
nal fit behaves like a linear function of e, as observed in 
the previous global analysis [33} • Although nonlinearities 
appear at the low e and Q 2 values (bottom panel of Fig. 
[4]), in this kinematical domain, the fits have large uncer- 
tainties, and the obtained results are in agreement with 
the linear approximation. 

The obtained TPE function has a particular analyti- 
cal form (see Appendix |E|) . which can be written as the 
Taylor series in e. If one neglects higher rather than lin- 
ear e terms, then the TPE correction is the sum of two 
contributions, which play a particular role in the LT sep- 
aration. One of them corrects the magnetic FF, and it 
appears to be negative. The other modifies the electric 
FF, and it is the positive function of Q 2 . 

Notice that the PT data are not present below Q 2 = 
0.16 GeV 2 . Hence its influence on the extraction of the 
TPE in this kinematical range is small. Although because 
of the lack of PT data, the TPE is still constrained in 



the low Q 2 region. Namely, there are several e + p/e~p 
ratio data points 14| . Additionally, we keep one artificial 
point at Q 2 = 0, and e = 1 which constrains AC2 7 . 
Also, there are plenty of cross section data points and 
the two artificial FF points. The presence of the FF 
points determines the low Q 2 behavior of the (J\ lt R- All 
together, provides restrictions on the extraction of the 
TPE term. 

In Fig. QJb), we show the e dependence of AC2 7 at 
several values of Q 2 . It can be seen that at very low Q 2 , 
the TPE term becomes positive. However, similar to the 
above, because of the large uncertainties, this effect is 
consistent with AC2 7 = 0. 

The aim of this paper was to find the approximation 
of the proton FFs and the TPE function based on the 
knowledge of the elastic ep scattering data. It was per- 
formed by adapting the Bayesian statistical methods de- 
veloped for the feed forward NNs. We assumed that the 
TPE correction does not affect the PT ratio data. This 
assumption turned out to be necessary to perform the nu- 
merical analysis, but one should keep in mind that there 
is no perfect approximation at low Q 2 . 

We discussed as many different NN parametrizations 
as required to find the optimal model. The best model 
was indicated by the Bayesian algorithm. From this 
point of view the results are model independent. The 
obtained TPE fit turned out to have nontrivial Q 2 de- 
pendence. In some kinematical regions (very low Q 2 and 
Q 2 ~ 1.5 GeV 2 , e > 0.8), it is the positive function. 

Let us emphasize that we considered the simplest work- 
ing BF. Only one regularization parameter was discussed, 
and the Hessian approximation was applied. Further de- 
velopment of the approach might improve the results of 
the analysis. In particular, it allows going beyond the co- 
variance matrix approximation used for the estimation of 
the fit uncertainty. The improvements require introduc- 
ing modifications at every step of the BF. The new ap- 
proach will also need greater computational power than 
the present one. 

The analytical form of the fits is shown in Appendix 
lEl w hereas the covariance matrix can be taken from Ref. 
[34j . All numerical computations have been performed 
with the C++ library developed by K.M.G. 
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Appendix A: Error Functions 



The cross section error function reads as 



E 

fc=i 



,1=1 



ki 



Aa ki 



% - 1 



, (Al) 



where iV CT is the number of independent cross section data 
sets, rifc is the number of points in the kth data set, % 
is the normalization parameter for the fcth data set, Arjk 
is the normalization (systematic) uncertainty, a k x is the 
experimental value of the reduced cross section of the 
«th data point in the kth data set measured for Q 2 .- and 



Ck, 



'ki 

Aer^f denotes the corresponding experimental uncer- 



tainty, and a\\ = cri 7 +2 7 ,ii(<2fc;, £fci)- 

The PT ratio data error function reads as 



Xpt = Yl 

i=l 



K 



iii 



ATI., 



(A2) 



where n^ T is the number of PT ratio data points, TZf x 
denotes the experimental value of the ith point, mea- 
sured for Ql, AlZf x is the corresponding experimental 
uncertainty, and TZ^ 1 = TZi 7 (Q 2 ). 

Analogically the positron-electron ratio data error 
function reads as 



xl 



i=l 



~a1iJ 



(A3) 



±,ea: 



where n fc is the number of PT ratio data points, 1Z, 
denotes the experimental value of the ith point, mea- 
sured for Q 2 and q values, ATZf' ex is the corresponding 



experimental uncertainty, and 1Z 



±.th 



n±{Ql 



The y n , n reads as 



Xg 



G-l 
A 



(A4) 



where G — Gm/^p or Ge- 



function (|T4|) . must satisfy the property. 

q dS ex 

which can be rewritten as 



k = l,...,Na, 



(Bl) 



n k th ex 
\ ~» a ki a ki 



+ 



f]k 



- (A<T t! ) 2 (A Vk ) 



Ml) 2 



i 



(B2) 



The above expression is used for updating the normaliza- 
tion parameters during training. The procedure turned 
out to be convergent, as long as the minimum of the to- 
tal error function was reached. It is interesting to notice 
that the normalization parameters obtained in this paper 
are very similar to those obtained in our previous global 
analysis [16|, where the MINUIT package (now it is one 
of the packages of the root library) was applied to find the 
optimal values of the fit and normalization parameters. 

Appendix C: Regularization Parameter 

The oimp parameter is computed in the so-called Hes- 
sian approximation [28] . It is given by the solution of the 
equation, 



2a.MpE w (wMp) = Y2 



A, + ump 



(CI) 



where A^'s are eigenvalues of the matrix 

V„V m Sex\ iS^WMP an °- Vi = 



In practice, to find the optimal aMP, the a parameter 
is iteratively updated during the training process, i.e., 



cufe+i = j(ak)/2E w (w), 



(C2) 



where a k denotes the value of the normalization param- 
eter in the kth iteration step of training. 



Appendix B: Data Normalization 

The optimal values of the normalization parameters 
rjk, k — 1, 2, N a , at the minimum of the total error 



Appendix D: Evidence 

The logarithm of evidence, the terms that are the same 
for different network architectures that are omitted, reads 
as 



1. . .. W . I.7 



symmetry factor 



\nV (V\ Ag, t ) « S ex (V, wmp) -a M pE w {w M p) - - hi |^| + tt lna MP - ■= In jr + (g + 1) ln(2) + ln(g!) + ln(<!), 
^ ' z z z z 



misfit ~ 

(Jccajn factor 



(Dl) 
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where W is the number of weigh parameters, and 
|A| is the determinant of the Hessian matrix Ay = 

ViVj Sex\iB =aMP +a M p- 

The first term in Eq. (|D1[) . usually of low- value, is the 
misfit of the approximated data, while the other terms 
contribute to the Occam factor. The latter penalizes 
the complex models. An additional contribution to this 
quantity is given by the symmetry factor. 

In the network A. g ,t, some hidden units can be inter- 
changed, but the response of the network (output values) 
reminds unchanged. It means that for a given configura- 
tion of weights, there exists some number of equivalent 
networks, which differ only by the appropriate permuta- 
tion of the weights. It gives rise to the appearance of the 
additional combinatorial factor in the evidence. However, 
in this paper, it does not play a significant role. 

Appendix E: Analytical Form of Fits 



W 0,9 


= 0.37908, 


Wo, 8 = 


0.01100, 


W 1,9 


= 0.33989, 


101,8 = 


0.05041, 


1"2,9 


= 1.25938, 


™2,8 


0.68383, 


""0, 7 


= 1.02290, 


■ = 


3.26523, 


10 2 ,7 


= 0.97734, 


™2,6 = 


3.63669, 


">0,5 


= 0.71814, 


™0,4 = 


0.25270, 


1»2, 5 


= 2.42620, 


«>2,4 = 


-1.60333, 


WO, 3 


= -1.22837 


1"!,3 = 


= 1.44729, 


Wl,3 


= 101,4 = a)l, 6 = W le = Mil, 7 = 



Notice that Q 2 in the above formulas is meant to be 
in units of GeV 2 . 

The FF and TPE parametrizations are obtained for 
Q 2 G (0, 10 GeV 2 ) and e G (0, 1). However, one should 
keep the problems of the extraction of TPE at very low 
Q 2 in mind (see the discussion in the last section of the 
paper). 



We now have 

G M (Q 2 ) 
V P G D (Q 2 ) 

G E (Q 2 ) 
G D (Q 2 ) 

AC 2l (Q 2 ,e) 
G 2 D {Q 2 ) 

G D (Q 2 ) 



= ^ w i:15fact{Q 2 WQ,i + W 2 ,i) + Wu,15, 



i=3 



^ Wi,16fact(Q 2 W0 <i + W 2 .i) + Wl4,16, 



i=3 



13 



[2 

(El) 

[3 



(E2) 



[4 



i=3 



[5 



(l + Q 2 /0.7ir 



X) = 




1 






1 


+ exp(— x) ' 




1«3,17 




-0.7994, 


«)4,17 = 


0.9775, 


l"5,17 




4.70641, 


W6.17 = 


-0.5378, 


M>7,17 




0.20026, 


W8.17 = 


0.08842, 


l"9,17 




-5.25238, 


M>10,17 


= 6.91219, 


f)ll,17 




-4.09499, 


1012,17 


= 1.53302, 


•["13,17 




-1.29911, 


1014,17 


= -2.32656 


U)3,16 




0.87769, 


104,16 = 


-1.42417, 






5.31229, 


106,16 = 


-7.03220, 


v>7,ie 




1.16534, 


■1«14,1 B = 


= 1.64949, 


W 3 ,15 




0.06206, 


104,15 = 


-0.16708, 


W5,1S 




-2.05062, 


106,15 


= -3.18562, 


W7.15 




1.43697, 


1014,15 - 


: 4.91944 


H->0,13 




0.01442, 


100,12 = 


0.14829, 






0.15599, 


101,12 = 


0.50796, 


W 2 ,13 




0.34353 W2,i2 = - 


-0.91625, 


«>0,11 




0.41505, 


100,10 = 


-0.44004, 






-0.13263, 


101,10 


= 0.65672, 


!«2,11 




1.80434, 


102,10 = 


3.66358, 
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